Propensity Score Analysis with Machine Learning

HAD 7002H Causal Inference Spring/Summer 2024

Author

Kuan Liu

Dataset - The Right Heart Catheterization

For this tutorial, we will be using the same right heart catheterization (RHC) dataset.

Data import and processing

library(tidyverse)
options(scipen = 999)
data <- read.csv("data/rhc.csv", header=T)

# define exposure variable
data$A <- ifelse(data$swang1 =="No RHC", 0, 1)

# outcome is dth30, a binary outcome measuring survival status at day 30;
data$Y <- ifelse(data$dth30 =="No", 0, 1)

Finalizing dataset for causal analysis

# we create our analysis data by removing variables with large proportion of missing;
# and variables not used in the analysis;
data2 <- select(data, -c(cat2, adld3p, urin1, swang1,
                         sadmdte, dschdte, dthdte, lstctdte, death, dth30,
                         surv2md1, das2d3pc, t3d30, ptid)) 
data2 <- rename(data2, id = X)

Proposensity score analysis using machine learning technqiues

1 Super (Machine) Learning

  • List of machine learning algorithms under SuperLearner R package
library(SuperLearner)
listWrappers()
 [1] "SL.bartMachine"      "SL.bayesglm"         "SL.biglasso"        
 [4] "SL.caret"            "SL.caret.rpart"      "SL.cforest"         
 [7] "SL.earth"            "SL.gam"              "SL.gbm"             
[10] "SL.glm"              "SL.glm.interaction"  "SL.glmnet"          
[13] "SL.ipredbagg"        "SL.kernelKnn"        "SL.knn"             
[16] "SL.ksvm"             "SL.lda"              "SL.leekasso"        
[19] "SL.lm"               "SL.loess"            "SL.logreg"          
[22] "SL.mean"             "SL.nnet"             "SL.nnls"            
[25] "SL.polymars"         "SL.qda"              "SL.randomForest"    
[28] "SL.ranger"           "SL.ridge"            "SL.rpart"           
[31] "SL.rpartPrune"       "SL.speedglm"         "SL.speedlm"         
[34] "SL.step"             "SL.step.forward"     "SL.step.interaction"
[37] "SL.stepAIC"          "SL.svm"              "SL.template"        
[40] "SL.xgboost"         
[1] "All"
[1] "screen.corP"           "screen.corRank"        "screen.glmnet"        
[4] "screen.randomForest"   "screen.SIS"            "screen.template"      
[7] "screen.ttest"          "write.screen.template"

2 Using machine learning methods with PSA

  • We can use ML to model our propensity score model
  • The use of machine learning methods is more flexible than parametric methods (i.e., logistic regression)
    • Not without a cost, usually the more flexible the methods are the more one is at risk of overfitting; Too much noise considered in the modelling often results in poor coverage probability.
    • There is no one approach that out performs others, thus which approach to use should be evaluated case by case.
    • ML is generally suggested for large enough cohort and for modelling large set of covariates.
    • It’s always suggested to include results from conventional logistic regression approach as a sensitivity analysis in comparison of ML approaches.
  • Many approaches are included in the WeightIt package, https://ngreifer.github.io/WeightIt/reference/method_super.html
    • “gbm”, Propensity score weighting using generalized boosted modeling (also known as gradient boosting machines)
    • “super”, Propensity score weighting using SuperLearner
    • “bart”, Propensity score weighting using Bayesian additive regression trees (BART)
      • Bayesian Additive Regression Trees (BART) is a sum-of-trees model for approximating an unknown function. To avoid overfitting (of decision tree), BART uses a regularization prior that forces each tree to be able to explain only a limited subset of the relationships between the covariates and the predictor variable.

Setting up data and PS model formula

require(tidyverse)
require(WeightIt)

covariates <- select(data2, -c(id, A, Y))
baselines <- colnames(covariates)
ps.formula <- as.formula(paste("A~", 
                paste(baselines, collapse = "+")))

PS model with gradient boosting

  • computationally more demanding and it might take several minutes to run.
IPTW_gbm <- weightit(ps.formula,
                 data = data2,
                 method = "gbm",
                 stabilize = TRUE)
# saving the model output as a R object to avoid rerunning the same model;
saveRDS(IPTW_gbm, file = "data/IPTW_gbm")
# reading saved model output;
require(sjPlot)
IPTW_gbm <- readRDS(file = "data/IPTW_gbm")
summary(IPTW_gbm)
                 Summary of weights

- Weight ranges:

           Min                                   Max
treated 0.3874 |---------------------------| 10.9875
control 0.6227  |------------|                5.6764

- Units with the 5 most extreme weights by group:
                                            
           5525   3148   4890   1825    5131
 treated 4.0556 4.6873 5.1931 5.3152 10.9875
            497   2046   3830   1602    1000
 control 4.6689 4.8977  4.929 4.9313  5.6764

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       0.663 0.397   0.143       0
control       0.426 0.262   0.067       0

- Effective Sample Sizes:

           Control Treated
Unweighted 3551.   2184.  
Weighted   3006.24 1517.47
fit2_gbm <- glm(Y ~ A, 
            family = "binomial",
            weights = IPTW_gbm$weights,
            data = data2)
tab_model(fit2_gbm)
  Y
Predictors Odds Ratios CI p
(Intercept) 0.46 0.42 – 0.49 <0.001
A 1.27 1.12 – 1.43 <0.001
Observations 5735
R2 Tjur 0.004

PS model with Super Learner

IPTW_SL <- weightit(ps.formula,
                 data = data2,
                 method = "super",
                 SL.library=c("SL.randomForest", "SL.glmnet", "SL.nnet"), 
                 stabilize = TRUE)
# saving the model output as a R object to avoid rerunning the same model;
saveRDS(IPTW_SL, file = "data/IPTW_SL")
# reading saved model output;
IPTW_SL <- readRDS(file = "data/IPTW_SL")
summary(IPTW_SL)
                 Summary of weights

- Weight ranges:

           Min                                  Max
treated 0.4008 |-----------------|           1.0346
control 0.6254        |--------------------| 1.4140

- Units with the 5 most extreme weights by group:
                                           
            742   5131   3508   1825   4890
 treated 0.9223 0.9489  0.968 0.9808 1.0346
           3391   3216   2174   2986    505
 control 1.3373 1.3461 1.3533 1.3627  1.414

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       0.184 0.147   0.016       0
control       0.159 0.125   0.012       0

- Effective Sample Sizes:

           Control Treated
Unweighted 3551.    2184. 
Weighted   3463.76  2112.6
fit2_SL <- glm(Y ~ A, 
            family = "binomial",
            weights = IPTW_SL$weights,
            data = data2)
tab_model(fit2_SL)
  Y
Predictors Odds Ratios CI p
(Intercept) 0.45 0.41 – 0.48 <0.001
A 1.35 1.17 – 1.56 <0.001
Observations 5735
R2 Tjur 0.005

PS model with Bayesian additive regression trees

  • A much faster algorithm comparing to gbm and SL.
IPTW_bart <- weightit(ps.formula,
                 data = data2,
                 method = "bart",
                 stabilize = TRUE)
# saving the model output as a R object to avoid rerunning the same model;
saveRDS(IPTW_bart, file = "data/IPTW_bart")
# reading saved model output;
IPTW_bart <- readRDS(file = "data/IPTW_bart")
summary(IPTW_bart)
                 Summary of weights

- Weight ranges:

           Min                                   Max
treated 0.3861 |---------------------------| 13.8560
control 0.6211 |------------------|           9.8333

- Units with the 5 most extreme weights by group:
                                             
           3148   4119    1825    5131   4890
 treated 8.7361 9.5095 11.8258 12.7766 13.856
             84   1810    3839    1000    505
 control 6.7109 6.7279  7.9066  7.9218 9.8333

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       0.926 0.471   0.226       0
control       0.544 0.308   0.096       0

- Effective Sample Sizes:

           Control Treated
Unweighted 3551.   2184.  
Weighted   2740.56 1175.59
fit2_bart <- glm(Y ~ A, 
            family = "binomial",
            weights = IPTW_bart$weights,
            data = data2)
tab_model(fit2_bart)
  Y
Predictors Odds Ratios CI p
(Intercept) 0.45 0.42 – 0.48 <0.001
A 1.30 1.16 – 1.46 <0.001
Observations 5735
R2 Tjur 0.005

TMLE in Steps

set.seed(123)
library(xgboost)
library(tmle)
data2.noY <- select(data2, -c(id, Y))

1. Initial G-formula estimate using superlearner

Y.fit.sl <- SuperLearner(Y=data2$Y, 
                       X=data2.noY, 
                       cvControl = list(V = 3),
                       SL.library=c("SL.glm", 
                                    "SL.glmnet", 
                                    "SL.xgboost"),
                       method="method.CC_nloglik", 
                       family="binomial")
saveRDS(Y.fit.sl, file = "data/SL_TMLE_G1")

2. Getting predicted Y

# reading saved SL G-formula output;
Y.fit.sl <- readRDS(file = "data/SL_TMLE_G1")
# predict Y under the observed A;
data2$init.Pred <- predict(Y.fit.sl, newdata = data2.noY, 
                           type = "response")$pred
summary(data2$init.Pred)
       V1         
 Min.   :0.01274  
 1st Qu.:0.15575  
 Median :0.26426  
 Mean   :0.32912  
 3rd Qu.:0.46196  
 Max.   :0.98741  
#treated;
data2.noY$A <- 1
data2$Pred.Y1 <- predict(Y.fit.sl, newdata = data2.noY, 
                           type = "response")$pred
summary(data2$Pred.Y1)
       V1        
 Min.   :0.0163  
 1st Qu.:0.1822  
 Median :0.2962  
 Mean   :0.3608  
 3rd Qu.:0.5086  
 Max.   :0.9912  
#control;
data2.noY$A <- 0
data2$Pred.Y0 <- predict(Y.fit.sl, newdata = data2.noY, 
                           type = "response")$pred
summary(data2$Pred.Y0)
       V1         
 Min.   :0.01274  
 1st Qu.:0.14496  
 Median :0.23888  
 Mean   :0.30848  
 3rd Qu.:0.43069  
 Max.   :0.98741  
# Compute initial treatment effect estimate;
data2$Pred.TE <- data2$Pred.Y1 - data2$Pred.Y0   
summary(data2$Pred.TE) 
       V1          
 Min.   :0.003554  
 1st Qu.:0.034483  
 Median :0.051963  
 Mean   :0.052269  
 3rd Qu.:0.067073  
 Max.   :0.142908  
plot(density(data2$Pred.TE))

3. Fitting Treatment Model (PS model)

set.seed(124)
#covariates are variables not including A & Y;
# covariates <- select(data2, -c(id, A, Y))
PS.fit.SL <- SuperLearner(Y=data2$A, 
                       X=covariates, 
                       cvControl = list(V = 3),
                       SL.library=c("SL.glm", 
                                    "SL.glmnet", 
                                    "SL.xgboost"),
                       method="method.CC_nloglik",
                       family="binomial")
saveRDS(PS.fit.SL, file = "data/SL_TMLE_PS")

Predict propensity scores

# reading saved SL PS output;
PS.fit.SL <- readRDS(file = "data/SL_TMLE_PS")

# predict A = 1 given covariates;
all.pred <- predict(PS.fit.SL, type = "response")
data2$PS.SL <- all.pred$pred 

tapply(data2$PS.SL, data2$A, summary)
$`0`
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.003699 0.083554 0.178727 0.219162 0.320084 0.862771 

$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05175 0.49753 0.65254 0.62832 0.78082 0.98436 
plot(density(data2$PS.SL[data2$A==0]), 
     col = "red", main = "")
lines(density(data2$PS.SL[data2$A==1]), 
      col = "blue", lty = 2)
legend("topright", c("No RHC","RHC"), 
       col = c("red", "blue"), lty=1:2) 

4. Estimate clever covariate H

data2$H.A1L <- (data2$A) / data2$PS.SL 
data2$H.A0L <- (1-data2$A) / (1- data2$PS.SL)
data2$H.AL <- data2$H.A1L - data2$H.A0L
summary(data2$H.AL)
       V1        
 Min.   :-7.287  
 1st Qu.:-1.296  
 Median :-1.071  
 Mean   :-0.143  
 3rd Qu.: 1.368  
 Max.   :19.323  
tapply(data2$H.AL, data2$A, summary)
$`0`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -7.287  -1.471  -1.218  -1.369  -1.091  -1.004 

$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.016   1.281   1.532   1.850   2.010  19.323 

5. Estimate \(e\) the fluctuation parameter

eps_mod <- glm(Y ~ -1 + H.AL +  
                 offset(qlogis(init.Pred)), 
               family = "binomial",
               data = data2)
epsilon <- coef(eps_mod)  
epsilon
       H.AL 
-0.00181726 

\(e\) is pretty small! We probably don’t need to update \(e\).

6. Update \(e\)

data2$Pred.Y1.update1 <- plogis(qlogis(data2$Pred.Y1) +  
                                   epsilon*data2$H.AL)
data2$Pred.Y0.update1 <- plogis(qlogis(data2$Pred.Y0) + 
                                   epsilon*data2$H.AL)
summary(data2$Pred.Y1.update1)
       V1         
 Min.   :0.01633  
 1st Qu.:0.18213  
 Median :0.29582  
 Mean   :0.36079  
 3rd Qu.:0.50856  
 Max.   :0.99124  
summary(data2$Pred.Y0.update1) 
       V1         
 Min.   :0.01277  
 1st Qu.:0.14505  
 Median :0.23922  
 Mean   :0.30852  
 3rd Qu.:0.43000  
 Max.   :0.98744  
data2$update.Pred <- ifelse(data2$A==1, data2$Pred.Y1.update1, data2$Pred.Y0.update1)

eps_mod1 <- glm(Y ~ -1 + H.AL +  
                 offset(qlogis(update.Pred)), 
               family = "binomial",
               data = data2)
epsilon1 <- coef(eps_mod1)  
epsilon1
                   H.AL 
0.000000000000000197404 

7. Estimate ATE

data2$Pred.Y1.update2 <- plogis(qlogis(data2$Pred.Y1.update1) +  
                                   epsilon1*data2$H.AL)
data2$Pred.Y0.update2 <- plogis(qlogis(data2$Pred.Y0.update1) + 
                                   epsilon1*data2$H.AL)
summary(data2$Pred.Y1.update2)
       V1         
 Min.   :0.01633  
 1st Qu.:0.18213  
 Median :0.29582  
 Mean   :0.36079  
 3rd Qu.:0.50856  
 Max.   :0.99124  
summary(data2$Pred.Y0.update2) 
       V1         
 Min.   :0.01277  
 1st Qu.:0.14505  
 Median :0.23922  
 Mean   :0.30852  
 3rd Qu.:0.43000  
 Max.   :0.98744  
# Risk difference;
ATE <- data2$Pred.Y1.update2 -  data2$Pred.Y0.update2
summary(ATE)
       V1         
 Min.   :0.00356  
 1st Qu.:0.03446  
 Median :0.05198  
 Mean   :0.05227  
 3rd Qu.:0.06707  
 Max.   :0.14291  
ATE.TMLE<-mean(ATE)

8. Estimate confidence interval using influence curve

ci.estimate <- function(data = data2){

  # transform predicted outcomes back to original scale
  EY1 <- mean(data$Pred.Y1.update2, na.rm = TRUE)
  EY0 <- mean(data$Pred.Y0.update2, na.rm = TRUE)
  # ATE efficient influence curve
  D1 <- data$A/data$PS.SL*
    (data$Y - data$Pred.Y1.update2) + 
    data$Pred.Y1.update2 - EY1
  D0 <- (1 - data$A)/(1 - data$PS.SL)*
    (data$Y - data$Pred.Y0.update2) + 
    data$Pred.Y0.update2 - EY0
  EIC <- D1 - D0
  # ATE variance
  n <- nrow(data)
  varHat.IC <- var(EIC, na.rm = TRUE)/n
  # ATE 95% CI
  ATE.TMLE.CI <- c(ATE.TMLE - 1.96*sqrt(varHat.IC), 
                   ATE.TMLE + 1.96*sqrt(varHat.IC))
  return(ATE.TMLE.CI) 
}

print(c(mean(ATE), ci.estimate(data = data2)))
[1] 0.05227224 0.03604141 0.06850306

TMLE using tmle R package

# run tmle from the tmle package 
SL.library = c("SL.glm", 
               "SL.glmnet", 
               "SL.xgboost")

tmle.fit <- tmle(Y = data2$Y, 
                   A = data2$A, 
                   W = covariates, 
                   family = "binomial",
                   V.Q = 3, #outcome model;
                   V.g = 3, #treatment model;
                   Q.SL.library = SL.library, 
                   g.SL.library = SL.library)

saveRDS(tmle.fit, file = "data/SL_TMLE")

ATE estimated using tmle package

# reading saved TMLE output;
tmle.fit <- readRDS(file = "data/SL_TMLE")
summary(tmle.fit)
 Initial estimation of Q
     Procedure: cv-SuperLearner, ensemble
     Model:
         Y ~  SL.glm_All + SL.glmnet_All + SL.xgboost_All

     Coefficients: 
          SL.glm_All    0.02185999 
       SL.glmnet_All    0.7663823 
      SL.xgboost_All    0.2117577 

     Cross-validated pseudo R squared :  0.1546 

 Estimation of g (treatment mechanism)
     Procedure: SuperLearner, ensemble
     Model:
         A ~  SL.glm_All + SL.glmnet_All + SL.xgboost_All 

     Coefficients: 
          SL.glm_All    0 
       SL.glmnet_All    0.624174 
      SL.xgboost_All    0.375826 

 Estimation of g.Z (intermediate variable assignment mechanism)
     Procedure: No intermediate variable 

 Estimation of g.Delta (missingness mechanism)
     Procedure: No missingness, ensemble

 Bounds on g: (0.0076, 1) 

 Bounds on g for ATT/ATC: (0.0076, 0.9924) 

 Marginal Mean under Treatment (EY1)
   Parameter Estimate:  0.36695
   Estimated Variance:  0.000054246
              p-value:  <2e-16
    95% Conf Interval:  (0.35252, 0.38139)

 Marginal Mean under Comparator (EY0)
   Parameter Estimate:  0.31205
   Estimated Variance:  0.0000468
              p-value:  <2e-16
    95% Conf Interval:  (0.29865, 0.32546)

 Additive Effect
   Parameter Estimate:  0.054898
   Estimated Variance:  0.000086137
              p-value:  0.0000000033172
    95% Conf Interval:  (0.036707, 0.073088)

 Additive Effect among the Treated
   Parameter Estimate:  0.21568
   Estimated Variance:  0.0010373
              p-value:  0.000000000021354
    95% Conf Interval:  (0.15255, 0.2788)

 Additive Effect among the Controls
   Parameter Estimate:  -0.074178
   Estimated Variance:  0.000071118
              p-value:  <2e-16
    95% Conf Interval:  (-0.090707, -0.05765)

 Relative Risk
   Parameter  Estimate:  1.1759
   Variance(log scale):  0.00075326
               p-value:  0.0000000035356
     95% Conf Interval:  (1.1143, 1.2409)

 Odds Ratio
    Parameter  Estimate:  1.2779
    Variance(log scale):  0.0017218
                p-value:  0.000000003428
      95% Conf Interval:  (1.1781, 1.3862)